Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).
TrainSet <- read.csv("pml-training.csv", stringsAsFactors = F,na.strings = c("","NA","#DIV/0!"))
TestSet <- read.csv("pml-testing.csv", stringsAsFactors = F,na.strings = c("","NA","#DIV/0!"))
dim(TrainSet); dim(TestSet)
## [1] 19622 160
## [1] 20 160
Validation Model:
set.seed(101)
inTrain <- createDataPartition(TrainSet$classe, p = 0.8, list = F)
DatValue <- TrainSet[-inTrain,]
TrainSet <- TrainSet[inTrain,]
dim(TrainSet); dim(DatValue)
## [1] 15699 160
## [1] 3923 160
table(TrainSet$classe)/nrow(TrainSet)
##
## A B C D E
## 0.2843493 0.1935155 0.1744060 0.1638958 0.1838334
The data is not much baised with respect to different classes
For Belt sensor:
belt_miss <- sapply(select(TrainSet,names(TrainSet)[grepl("_belt",names(TrainSet))]),
function(x) sum(is.na(x)))
belt_miss
## roll_belt pitch_belt yaw_belt
## 0 0 0
## total_accel_belt kurtosis_roll_belt kurtosis_picth_belt
## 0 15396 15413
## kurtosis_yaw_belt skewness_roll_belt skewness_roll_belt.1
## 15699 15395 15413
## skewness_yaw_belt max_roll_belt max_picth_belt
## 15699 15388 15388
## max_yaw_belt min_roll_belt min_pitch_belt
## 15396 15388 15388
## min_yaw_belt amplitude_roll_belt amplitude_pitch_belt
## 15396 15388 15388
## amplitude_yaw_belt var_total_accel_belt avg_roll_belt
## 15396 15388 15388
## stddev_roll_belt var_roll_belt avg_pitch_belt
## 15388 15388 15388
## stddev_pitch_belt var_pitch_belt avg_yaw_belt
## 15388 15388 15388
## stddev_yaw_belt var_yaw_belt gyros_belt_x
## 15388 15388 0
## gyros_belt_y gyros_belt_z accel_belt_x
## 0 0 0
## accel_belt_y accel_belt_z magnet_belt_x
## 0 0 0
## magnet_belt_y magnet_belt_z
## 0 0
For Arm sensor:
arm_miss <- sapply(select(TrainSet,names(TrainSet)[grepl("_arm",names(TrainSet))]),
function(x) sum(is.na(x)))
arm_miss
## roll_arm pitch_arm yaw_arm total_accel_arm
## 0 0 0 0
## var_accel_arm avg_roll_arm stddev_roll_arm var_roll_arm
## 15388 15388 15388 15388
## avg_pitch_arm stddev_pitch_arm var_pitch_arm avg_yaw_arm
## 15388 15388 15388 15388
## stddev_yaw_arm var_yaw_arm gyros_arm_x gyros_arm_y
## 15388 15388 0 0
## gyros_arm_z accel_arm_x accel_arm_y accel_arm_z
## 0 0 0 0
## magnet_arm_x magnet_arm_y magnet_arm_z kurtosis_roll_arm
## 0 0 0 15446
## kurtosis_picth_arm kurtosis_yaw_arm skewness_roll_arm skewness_pitch_arm
## 15448 15398 15445 15448
## skewness_yaw_arm max_roll_arm max_picth_arm max_yaw_arm
## 15398 15388 15388 15388
## min_roll_arm min_pitch_arm min_yaw_arm amplitude_roll_arm
## 15388 15388 15388 15388
## amplitude_pitch_arm amplitude_yaw_arm
## 15388 15388
For Forearm sensor:
forearm_miss <- sapply(select(TrainSet,
names(TrainSet)[grepl("_forearm",names(TrainSet))]),
function(x) sum(is.na(x)))
forearm_miss
## roll_forearm pitch_forearm yaw_forearm
## 0 0 0
## kurtosis_roll_forearm kurtosis_picth_forearm kurtosis_yaw_forearm
## 15448 15449 15699
## skewness_roll_forearm skewness_pitch_forearm skewness_yaw_forearm
## 15447 15449 15699
## max_roll_forearm max_picth_forearm max_yaw_forearm
## 15388 15388 15448
## min_roll_forearm min_pitch_forearm min_yaw_forearm
## 15388 15388 15448
## amplitude_roll_forearm amplitude_pitch_forearm amplitude_yaw_forearm
## 15388 15388 15448
## total_accel_forearm var_accel_forearm avg_roll_forearm
## 0 15388 15388
## stddev_roll_forearm var_roll_forearm avg_pitch_forearm
## 15388 15388 15388
## stddev_pitch_forearm var_pitch_forearm avg_yaw_forearm
## 15388 15388 15388
## stddev_yaw_forearm var_yaw_forearm gyros_forearm_x
## 15388 15388 0
## gyros_forearm_y gyros_forearm_z accel_forearm_x
## 0 0 0
## accel_forearm_y accel_forearm_z magnet_forearm_x
## 0 0 0
## magnet_forearm_y magnet_forearm_z
## 0 0
For Dumbbell sensor:
dumbbell_miss <- sapply(select(TrainSet,
names(TrainSet)[grepl("_dumbbell",names(TrainSet))]),
function(x) sum(is.na(x)))
dumbbell_miss
## roll_dumbbell pitch_dumbbell yaw_dumbbell
## 0 0 0
## kurtosis_roll_dumbbell kurtosis_picth_dumbbell kurtosis_yaw_dumbbell
## 15392 15390 15699
## skewness_roll_dumbbell skewness_pitch_dumbbell skewness_yaw_dumbbell
## 15391 15389 15699
## max_roll_dumbbell max_picth_dumbbell max_yaw_dumbbell
## 15388 15388 15392
## min_roll_dumbbell min_pitch_dumbbell min_yaw_dumbbell
## 15388 15388 15392
## amplitude_roll_dumbbell amplitude_pitch_dumbbell amplitude_yaw_dumbbell
## 15388 15388 15392
## total_accel_dumbbell var_accel_dumbbell avg_roll_dumbbell
## 0 15388 15388
## stddev_roll_dumbbell var_roll_dumbbell avg_pitch_dumbbell
## 15388 15388 15388
## stddev_pitch_dumbbell var_pitch_dumbbell avg_yaw_dumbbell
## 15388 15388 15388
## stddev_yaw_dumbbell var_yaw_dumbbell gyros_dumbbell_x
## 15388 15388 0
## gyros_dumbbell_y gyros_dumbbell_z accel_dumbbell_x
## 0 0 0
## accel_dumbbell_y accel_dumbbell_z magnet_dumbbell_x
## 0 0 0
## magnet_dumbbell_y magnet_dumbbell_z
## 0 0
Few of the features are over 90% missing therefore it is better to drop those, after the drop there are 52 predictors left.
column_2drop <- c(names(belt_miss[belt_miss != 0]),
names(arm_miss[arm_miss != 0]),
names(forearm_miss[forearm_miss != 0]),
names(dumbbell_miss[dumbbell_miss != 0]))
length(column_2drop)
## [1] 100
#dropping the cols
Anaylsis <- tbl_df(TrainSet %>% select(-column_2drop,
-c(X,user_name, raw_timestamp_part_1,
raw_timestamp_part_2, cvtd_timestamp,
new_window,num_window)))
## Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(column_2drop)` instead of `column_2drop` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
Anaylsis$classe <- as.factor(Anaylsis$classe)
Anaylsis[,1:52] <- lapply(Anaylsis[,1:52],as.numeric)
dim(Anaylsis)
## [1] 15699 53
corr_col <- cor(select(Anaylsis, -classe))
diag(corr_col) <- 0
corr_col <- which(abs(corr_col)>0.8,arr.ind = T)
corr_col <- unique(row.names(corr_col))
corrplot(cor(select(Anaylsis,corr_col)),
type="upper", order="hclust",method = "number")
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(corr_col)` instead of `corr_col` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
Correlation plot makes it clear that there are a lot of columns that are highly correlated and high correlation is only seen between the same sensor i.e. “belt”,“arm”,“forearm” and “dumbbell”.
We use correlationfunnel::correlate to see the correlation with each level of“classe” and other features
correlation_funl_df <- Anaylsis %>% binarize(n_bins = 4, thresh_infreq = 0.01)
correlation_a <- correlation_funl_df %>% correlate(target = classe__A)
correlation_a %>% plot_correlation_funnel(interactive = T,limits = c(-0.5,0.5))
For *classe__A* the “Arm and Forearm” sensors are more important.
correlation_b <- correlation_funl_df %>% correlate(target = classe__B)
correlation_b %>% plot_correlation_funnel(interactive = T,limits = c(-0.5,0.5))
For *classe__B* the “Dumbbell and Belt” sensors are more important.
correlation_c <- correlation_funl_df %>% correlate(target = classe__C)
correlation_c %>% plot_correlation_funnel(interactive = T,limits = c(-0.5,0.5))
For *classe__C* the “Dumbbell” sensors are more important.
correlation_d <- correlation_funl_df %>% correlate(target = classe__D)
correlation_d %>% plot_correlation_funnel(interactive = T,limits = c(-0.5,0.5))
For *classe__D* the “Forearm, Arm and Dumbbell” sensors are more important.
correlation_e <- correlation_funl_df %>% correlate(target = classe__E)
correlation_e %>% plot_correlation_funnel(interactive = T,limits = c(-0.5,0.5))
For *classe__E* the “Belt” sensors are more important.
Top 5 features are presented in these columns
col_a <- c("magnet_arm_x", "pitch_forearm" , "magnet_dumbbell_y",
"roll_forearm", "gyros_dumbbell_y")
col_b <- c("magnet_dumbbell_y", "magnet_dumbbell_x" , "roll_dumbbell" ,
"magnet_belt_y" , "accel_dumbbell_x" )
col_c <- c("magnet_dumbbell_y", "roll_dumbbell" , "accel_dumbbell_y" ,
"magnet_dumbbell_x", "magnet_dumbbell_z")
col_d <- c("pitch_forearm" , "magnet_arm_y" , "magnet_forearm_x",
"accel_dumbbell_y", "accel_forearm_x")
col_e <- c("magnet_belt_y" , "magnet_belt_z" , "roll_belt",
"gyros_belt_z" , "magnet_dumbbell_y")
final_cols <- character()
for(c in c(col_a,col_b,col_c,col_d,col_e)){
final_cols <- union(final_cols, c)
}
Anaylsis2 <- Anaylsis %>% select(final_cols, classe)
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(final_cols)` instead of `final_cols` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
data.frame("arm" = sum(grepl("_arm",final_cols)),
"forearm" = sum(grepl("_forearm",final_cols)),
"belt" = sum(grepl("_belt",final_cols)),
"dumbbell" = sum(grepl("_dumbbell",final_cols)))
## arm forearm belt dumbbell
## 1 2 4 4 7
TrainSet <- TrainSet %>% select(final_cols,classe)
DatValue <- DatValue %>% select(final_cols,classe)
TrainSet[,1:17] <- sapply(TrainSet[,1:17],as.numeric)
DatValue[,1:17] <- sapply(DatValue[,1:17],as.numeric)
levels <- c("A", "B", "C", "D", "E")
preprop_obj <- preProcess(TrainSet[,-18],method = c("center","scale","BoxCox"))
xTrain <- predict(preprop_obj,select(TrainSet,-classe))
yTrain <- factor(TrainSet$classe,levels=levels)
xVal <- predict(preprop_obj,select(DatValue,-classe))
yVal <- factor(DatValue$classe,levels=levels)
trControl <- trainControl(method="cv", number=5)
Dumbbell sensor turned out to be the most important sensor among the 4
my_dens <- function(data, mapping, ...) {
ggplot(data = data, mapping=mapping) +
geom_density(..., alpha = 0.3)+scale_fill_brewer(palette="Set2")
}
my_point <- function(data, mapping, ...) {
ggplot(data = data, mapping=mapping) +
geom_point(..., alpha = 0.1)+ scale_fill_brewer(palette="Set2")
}
ggpairs(Anaylsis2, columns = 1:5,aes(color = classe),
lower = list(continuous = my_point),diag = list(continuous = my_dens))
ggpairs(Anaylsis2, columns = 6:10,aes(color = classe),
lower = list(continuous = my_point),diag = list(continuous = my_dens))
ggpairs(Anaylsis2, columns = 11:17,aes(color = classe),
lower = list(continuous = my_point),diag = list(continuous = my_dens))
Most of the features are very skewed, so as a preprocessing step we have to “center”, “rescale” and use “BoxCox” the features.
modelRF <- train(x = xTrain,y = yTrain,
method = "rf", trControl = trControl,verbose=FALSE, metric = "Accuracy")
confusionMatrix(predict(modelRF,xVal),yVal)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1112 8 0 0 0
## B 3 741 6 0 1
## C 1 7 675 15 4
## D 0 3 3 627 1
## E 0 0 0 1 715
##
## Overall Statistics
##
## Accuracy : 0.9865
## 95% CI : (0.9824, 0.9899)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9829
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9964 0.9763 0.9868 0.9751 0.9917
## Specificity 0.9971 0.9968 0.9917 0.9979 0.9997
## Pos Pred Value 0.9929 0.9867 0.9615 0.9890 0.9986
## Neg Pred Value 0.9986 0.9943 0.9972 0.9951 0.9981
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2835 0.1889 0.1721 0.1598 0.1823
## Detection Prevalence 0.2855 0.1914 0.1789 0.1616 0.1825
## Balanced Accuracy 0.9968 0.9866 0.9893 0.9865 0.9957
plot(modelRF$finalModel,main="Error VS no of tree")
Random Forest 98%+ accuracy was the highest.
modelCT <- train(x = xTrain,y = yTrain,
method = "rpart", trControl = trControl)
confusionMatrix(predict(modelCT,xVal),yVal)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1003 330 319 294 106
## B 19 256 20 109 103
## C 93 173 345 240 212
## D 0 0 0 0 0
## E 1 0 0 0 300
##
## Overall Statistics
##
## Accuracy : 0.4853
## 95% CI : (0.4696, 0.5011)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3271
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8987 0.33729 0.50439 0.0000 0.41609
## Specificity 0.6263 0.92067 0.77833 1.0000 0.99969
## Pos Pred Value 0.4888 0.50493 0.32455 NaN 0.99668
## Neg Pred Value 0.9396 0.85275 0.88147 0.8361 0.88377
## Prevalence 0.2845 0.19347 0.17436 0.1639 0.18379
## Detection Rate 0.2557 0.06526 0.08794 0.0000 0.07647
## Detection Prevalence 0.5231 0.12924 0.27097 0.0000 0.07673
## Balanced Accuracy 0.7625 0.62898 0.64136 0.5000 0.70789
Classification tree’s accuracy is very low.
modelSVM <- svm(x = xTrain,y = yTrain,
kernel = "polynomial", cost = 10)
confusionMatrix(predict(modelSVM,xVal),yVal)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1096 40 18 17 2
## B 1 676 15 5 6
## C 9 40 640 45 3
## D 10 3 9 575 9
## E 0 0 2 1 701
##
## Overall Statistics
##
## Accuracy : 0.9401
## 95% CI : (0.9322, 0.9473)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9241
##
## Mcnemar's Test P-Value : 1.808e-15
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9821 0.8906 0.9357 0.8942 0.9723
## Specificity 0.9726 0.9915 0.9701 0.9905 0.9991
## Pos Pred Value 0.9344 0.9616 0.8684 0.9488 0.9957
## Neg Pred Value 0.9927 0.9742 0.9862 0.9795 0.9938
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2794 0.1723 0.1631 0.1466 0.1787
## Detection Prevalence 0.2990 0.1792 0.1879 0.1545 0.1795
## Balanced Accuracy 0.9773 0.9411 0.9529 0.9424 0.9857
It worked great but RM was better
So from the above analysis it is clear that Random Forest is taking the lead in term of prediction.